Digitised historical text: Does it have to be mediOCRe?

نویسندگان

  • Beatrice Alex
  • Claire Grover
  • Ewan Klein
  • Richard Tobin
چکیده

This paper reports on experiments to improve the Optical Character Recognition (ocr) quality of historical text as a preliminary step in text mining. We analyse the quality of ocred text compared to a gold standard and show how it can be improved by performing two automatic correction steps. We also demonstrate the impact this can have on named entity recognition in a preliminary extrinsic evaluation. This work was performed as part of the Trading Consequences project which is focussed on text mining of historical documents for the study of nineteenth century trade in the British Empire.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

VARD 2: A tool for dealing with spelling variation in historical corpora

Spelling variation causes considerable problems for corpus linguistic techniques such as frequency analysis, concordancing and automatic tagging, with a significant impact being made on recall and the accuracy of results [1]. This paper will focus on Early Modern English, the most recent period of the English language to include a large amount of inconsistent spelling. Although many corpora of ...

متن کامل

iDoc: Interactive Analysis, Transcription and Translation of Old Text Documents TIN2006-15694-C02

There are huge historical document collections residing in libraries, museums and archives that are currently being digitised for preservation purposes and to make them available worldwide through large, on-line digital libraries. The main objective, however, is not to simply provide access to raw images of digitised documents, but to annotate them with their real informative content and, in pa...

متن کامل

Named Entity Recognition for Digitised Historical Texts

We describe and evaluate a prototype system for recognising person and place names in digitised records of British parliamentary proceedings from the late 17th and early 19th centuries. The output of an OCR engine is the input for our system and we describe certain issues and errors in this data and discuss the methods we have used to overcome the problems. We describe our rule-based named enti...

متن کامل

Margins are more important than text, Historical values of margins, memorial notes and colophons of Manuscripts in Zoroastrian tradition

In the Zoroastrian tradition, the most important challenge and the most ambiguous issue is ambiguity in history and neglect of time and chronology. Perhaps, this approach that historical time is limit and the begging and end of time is clear and the goodness will be conqueror eventually; it is because of ambiguity of history in Zoroastrian tradition.since early time to now, the Zoroastrian re...

متن کامل

Poetic Affection in Historical prose of Nafsat ol-Masdoor

Abstract Nafsat ol-Masdoor is one of the outstanding artificial and technical texts in prose that remained from the Mongol era. This text has not only been mainly written to record historical events, but also, the writer has been intended to explain his biography and problems. The writer uses various literary arts to affect the addressee and one of these componen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012